Extract Content
						(Web Mining)
					
	
		
		
		Synopsis
Extracts content from an HTML document.Description
This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.
Input
documentThe document port.
Output
documentThe document port.
Parameters
- extract contentSpecifies whether content is extracted or not
 - minimum text block lengthThe minimum length (in words/tokens) of text blocks.
 - override content type informationSpecifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag.
 - neglegt span tagsSpecifies whether <span> tags should be neglected or used as text block divider.
 - neglect p tagsSpecifies whether <p> tags should be neglected or used as text block divider.
 - neglect b tagsSpecifies whether <b> tags should be neglected or used as text block divider.
 - neglect i tagsSpecifies whether <i> tags should be neglected or used as text block divider.
 - neglect br tagsSpecifies whether <br> tags should be neglected or used as text block divider.
 - ignore non html tagsSpecifies whether tags that are not common HTML should be ignored.